Tick occurunce¶

Trying to win the hackathon at OGH2023 by using Julia. By Maarten Pronk (@evetion).

Notebook to produce a tick occurence prediction. Very simple and quick with only 3 basic assumptions: ticks occur everywhere, but only at specific heights and ecotones. Humans get ticks on hiking paths. No thought has been given to a time, weather or species pattern, let alone statistical models to account for the (positive) reporting bias. Partial assumption here is that it is very difficult to create a negative testset (absolutely no ticks, can't prove a negative), so we essentially predicts all locations where a tick bite could occur anytime.

Input:

  • Height map (correlates with colder temperatures and less ticks)
  • Rasterized openstreetmap dump of (filtered) Swiss roadset
  • Ecotone feature based on changes in landcover

Output:

  • Binary classification and its probability
In [1]:
using Rasters, GeoDataFrames, DataFrames, ArchGDAL, MLJ, Plots, StatsBase, GeoArrays
folder = "/Users/evetion/Downloads/prepared"
Out[1]:
"/Users/evetion/Downloads/prepared"

Input data¶

Mask¶

In [3]:
mask = Raster(joinpath(folder, "Mask/Mask.tif"))
plot(mask)
Out[3]:

Roads¶

Based on our simple assumption, we download the latest OSM dump for Switzerland from https://download.geofabrik.de/europe/switzerland.html, convert it to gpkg with ogr2ogr and read it in here. Note that GeoDataFrames could read the pfb file directly, but it will be much slower than the gpkg.

In [6]:
df = GeoDataFrames.read(joinpath(folder, "roads.gpkg"), "lines")
Out[6]:
1589809×3 DataFrame
1589784 rows omitted
Rowgeomhighwayother_tags
IGeometr…StringString?
1Geometry: wkbLineStringmotorway_link"destination"=>"Interlaken; Kandersteg; Zweisimmen","oneway"=>"yes","ref"=>"A6","surface"=>"asphalt"
2Geometry: wkbLineStringsecondary"cycleway"=>"lane","lane_markings"=>"no","maxspeed"=>"50","maxspeed:type"=>"sign","oneway"=>"yes","sidewalk"=>"right","surface"=>"concrete"
3Geometry: wkbLineStringservice"surface"=>"paved"
4Geometry: wkbLineStringmotorway_link"destination"=>"Thun;Gunten","lanes"=>"1","maxspeed"=>"100","oneway"=>"yes","surface"=>"asphalt"
5Geometry: wkbLineStringmotorway_link"destination"=>"Bern","maxspeed"=>"60","oneway"=>"yes","surface"=>"asphalt"
6Geometry: wkbLineStringsecondary"bicycle"=>"use_sidepath","bridge"=>"yes","elevated"=>"+1","hazmat"=>"no","hgv"=>"no","lanes"=>"2","layer"=>"1","lit"=>"yes","maxspeed"=>"50","maxweight"=>"10","oneway"=>"yes","owner"=>"Kanton Zürich","shoulder"=>"no","sidewalk:left"=>"no","sidewalk:right"=>"separate","smoothness"=>"good","surface"=>"asphalt","wikidata"=>"Q1375301"
7Geometry: wkbLineStringresidential"lit"=>"yes","maxspeed"=>"30","sac_scale"=>"hiking","sidewalk"=>"left","surface"=>"asphalt","wikidata"=>"Q27329737"
8Geometry: wkbLineStringresidential"lanes"=>"1","surface"=>"asphalt"
9Geometry: wkbLineStringservice"surface"=>"asphalt"
10Geometry: wkbLineStringresidential"lit"=>"yes","surface"=>"asphalt"
11Geometry: wkbLineStringresidential"maxspeed"=>"30","mofa"=>"yes","motor_vehicle"=>"destination","surface"=>"asphalt"
12Geometry: wkbLineStringresidential"sidewalk"=>"right","surface"=>"asphalt"
13Geometry: wkbLineStringresidential"lanes"=>"1","lit"=>"yes","maxspeed"=>"30","sidewalk"=>"both","surface"=>"asphalt"
⋮⋮⋮⋮
1589798Geometry: wkbLineStringservice"service"=>"driveway"
1589799Geometry: wkbLineStringservice"service"=>"driveway"
1589800Geometry: wkbLineStringpathmissing
1589801Geometry: wkbLineStringpathmissing
1589802Geometry: wkbLineStringpathmissing
1589803Geometry: wkbLineStringpathmissing
1589804Geometry: wkbLineStringpathmissing
1589805Geometry: wkbLineStringservice"service"=>"driveway"
1589806Geometry: wkbLineStringtrack"bicycle"=>"designated","foot"=>"designated","motor_vehicle"=>"no","surface"=>"unpaved","tracktype"=>"grade3"
1589807Geometry: wkbLineStringtrack"bicycle"=>"designated","foot"=>"designated","motor_vehicle"=>"no","surface"=>"unpaved","tracktype"=>"grade3"
1589808Geometry: wkbLineStringtrack"bicycle"=>"designated","foot"=>"designated","layer"=>"-1","motor_vehicle"=>"no","surface"=>"unpaved","tracktype"=>"grade3","tunnel"=>"yes"
1589809Geometry: wkbLineStringtrack"bicycle"=>"no","foot"=>"no","horse"=>"yes","surface"=>"unpaved","tracktype"=>"grade2"

Ideally we further filter this dataset to only contain hiking footpaths, but it will take too much time to find the perfect combination of tags. Let's rasterize these 1.5M roads to our mask file.

In [8]:
rl = rasterize(count, df.geom, to=mask, threaded=true)
plot(rl)
Rasterizing... 100%|██████████████████████████████████████████████████| Time: 0:00:03
[ Info: 1 geometries did not affect any pixels. See `metadata(raster)["missed_geometries"]` for a vector of misses
Out[8]:

Ecotone map¶

Ticks like the transition between forest and more open landcovers. Let's make an approximation by checking all landcover class borders morphologically. Here I use the ESA WorldCover 2021 dataset, clipped to Switzerland. I disliked the 72 class provided dataset, as it was too detailed for my usecase here.

In [11]:
landcover = Raster(joinpath(folder, "esawc.tif"))
plot(landcover)
Out[11]:
In [13]:
# Faster version from https://github.com/JuliaImages/ImageFiltering.jl/issues/179
function mapwindow2(f, img, window)
   out = zeros(eltype(img), axes(img))
   R = CartesianIndices(img)
   I_first, I_last = first(R), last(R)
   Δ = CartesianIndex(ntuple(x->window ÷ 2, ndims(img)))
   @inbounds @simd for I in R
       patch = max(I_first, I-Δ):min(I_last, I+Δ)
       out[I] = f(view(img, patch))
   end
   return out
end
Out[13]:
mapwindow2 (generic function with 1 method)

We first make a map of all unique landcover classes in a 3x3 window. Higher counts indicate transitions.

In [14]:
using ImageFiltering
nlc = mapwindow2(x->length(unique(x)), landcover, 3);

Then we define our own window function (accepting only a 3x3 window) to find transitions between forest and open field-like landcovers, including water and urban.

In [56]:
"""
This returns true if the middle cell of a 3x3 window is forest
and any of the surrounding pixels are grassland, shrubland, cropland, urban or water.
There are more classes, but they don't occur in Switzerland in large numbers.
"""
function ecotone_esa(view)
    view[2,2] == 10 && (20 in view || 30 in view || 40 in view || 50 in view || 80 in view)
end
ecotone = mapwindow2(ecotone_esa, landcover, 3)
plot(ecotone)
Out[56]:
In [58]:
# Resample above derived datasets to the mask output for easy sampling later on
landcoverl = resample(landcover, to=mask)

# Copying for a small bug in Rasters.jl
ec = copy(landcover)
ec.= ecotone
ecotonel = resample(ec, to=mask)

nlcc = copy(landcover)
nlcc.= nlc
nlcl = resample(nlcc, to=mask);

Elevation¶

Elevation is a useful metric because ticks do not occur on the glaciers and high mountains because of the lack of vegetation and cold.

In [19]:
elevation = Raster(joinpath(folder, "DHM/DHM25_2056.tif"))
elevationl = resample(elevation, to=mask)
plot(elevationl)
Out[19]:

Training data¶

Read in the training points. Since these are all positive occurences (tickbite = yes), we are missing negative samples (tickbite = false). Ofcourse, these don't exist, so we need to invent them. This is the most influential decision for this hackathon, and a similar decision needs to be made for judging. Without negative samples our model will probably predict ticks everywhere.

In [26]:
training = GeoDataFrames.read(joinpath(folder, "../tick_reports_training.gpkg"))
Out[26]:
28446×6 DataFrame
28421 rows omitted
RowgeomidaccdatetimeXY
IGeometr…Int32Int32DateTimeFloat64Float64
1Geometry: wkbPoint48742015-03-11T21:09:272.70332e61.23318e6
2Geometry: wkbPoint52732015-03-11T20:39:012.70082e61.2336e6
3Geometry: wkbPoint592572015-03-13T07:54:012.55911e61.15314e6
4Geometry: wkbPoint631902015-03-14T16:38:272.60497e61.20022e6
5Geometry: wkbPoint663332015-03-15T18:59:352.68172e61.24709e6
6Geometry: wkbPoint672572015-03-14T20:42:362.65859e61.23487e6
7Geometry: wkbPoint69322015-03-16T09:41:242.55448e61.20329e6
8Geometry: wkbPoint702572015-03-17T12:46:122.5814e61.18908e6
9Geometry: wkbPoint712962015-03-16T19:28:282.68135e61.25081e6
10Geometry: wkbPoint731652015-03-17T11:53:332.72801e61.24543e6
11Geometry: wkbPoint752572015-03-18T20:48:282.67471e61.22108e6
12Geometry: wkbPoint76912015-03-21T11:12:212.67739e61.25329e6
13Geometry: wkbPoint80692015-03-23T07:16:372.50033e61.13898e6
⋮⋮⋮⋮⋮⋮⋮
28435Geometry: wkbPoint495421102020-11-29T10:59:492.55748e61.13255e6
28436Geometry: wkbPoint495431792020-11-29T20:26:052.70963e61.26002e6
28437Geometry: wkbPoint495523332020-12-18T19:45:022.64394e61.24363e6
28438Geometry: wkbPoint495533332020-12-20T05:43:342.60587e61.19233e6
28439Geometry: wkbPoint495543012020-12-24T08:09:442.53022e61.1674e6
28440Geometry: wkbPoint495553332020-12-24T16:09:172.61463e61.21286e6
28441Geometry: wkbPoint495571282020-12-27T10:17:322.71132e61.11632e6
28442Geometry: wkbPoint499102722020-06-02T06:48:582.71039e61.09232e6
28443Geometry: wkbPoint499492692020-07-04T06:45:502.69311e61.25481e6
28444Geometry: wkbPoint563729692018-09-30T10:12:592.65095e61.26869e6
28445Geometry: wkbPoint586233012016-08-20T17:53:092.62779e61.23675e6
28446Geometry: wkbPoint64904972019-04-20T16:05:222.611e61.19356e6

Absences¶

The above training dataset containst an accuracy of the reported measurements. To generate our negative sample, we will randomly sample the same amount of training observations, outside of the currently reported points with their accuracies.

In [91]:
pols = buffer.(training.geom, training.acc)  # polygons
x = rasterize(count, pols, to=mask, missingval=0, threaded=false)

# All cell indices without any reports
loc = findall(x.==0)

# A random 28k sample
ns = sample(loc, nrow(training))

# Convert these cell indices (CartesianIndex) to actual geometry points
nc = DimPoints(mask)[ns]
ng = ArchGDAL.createpoint.(nc)
Rasterizing... 100%|██████████████████████████████████████████████████| Time: 0:00:01
[ Info: 3127 geometries did not affect any pixels. See `metadata(raster)["missed_geometries"]` for a vector of misses
Out[91]:
28446-element Vector{ArchGDAL.IGeometry{ArchGDAL.wkbPoint}}:
 Geometry: POINT (2641300 1214900)
 Geometry: POINT (2779700 1253300)
 Geometry: POINT (2666900 1089200)
 Geometry: POINT (2769700 1235700)
 Geometry: POINT (2741500 1136700)
 Geometry: POINT (2596200 1161700)
 Geometry: POINT (2794700 1211000)
 Geometry: POINT (2578900 1295900)
 Geometry: POINT (2679700 1184000)
 Geometry: POINT (2536200 1199800)
 Geometry: POINT (2648800 1246900)
 Geometry: POINT (2516400 1286600)
 Geometry: POINT (2805900 1227300)
 ⋮
 Geometry: POINT (2654100 1087300)
 Geometry: POINT (2754200 1263400)
 Geometry: POINT (2637800 1175800)
 Geometry: POINT (2771500 1191400)
 Geometry: POINT (2589500 1099800)
 Geometry: POINT (2717800 1079900)
 Geometry: POINT (2736900 1159500)
 Geometry: POINT (2785800 1238000)
 Geometry: POINT (2627100 1138700)
 Geometry: POINT (2503200 1258400)
 Geometry: POINT (2568600 1189100)
 Geometry: POINT (2645000 1180700)
In [92]:
# Create a new DataFrame that combines both occurrences and absences
ndf = DataFrame(geom=ng, class=0)
tdf = select(training, :geom)
tdf.class .= 1

adf = vcat(ndf, tdf)
GeoDataFrames.write("all.gpkg", adf, geom_columns=(:geom,))
Out[92]:
"all.gpkg"

We plot a random subset (plotting all points takes too long on a default render) of points to validate the location of our newly added absences.

In [90]:
subset = sample(1:nrow(adf), 2000)
plot(adf.geom[subset], marker_z=adf.class[subset]', format=:png)
Out[90]:

Combining it all¶

With our new training data balanced, let's sample our features to create the actual training dataset for our model.

In [94]:
rs = RasterStack((;elevationl, roads=rl, landcover=landcoverl, ecotone=ecotonel, nlandcover=nlcl))
Out[94]:
RasterStack with dimensions: 
  X Projected{Float64} LinRange{Float64}(2.485e6, 2.8339e6, 3490) ForwardOrdered Regular Intervals crs: WellKnownText,
  Y Projected{Float64} LinRange{Float64}(1.2959e6, 1.075e6, 2210) ReverseOrdered Regular Intervals crs: WellKnownText
and 5 layers:
  :elevationl Float32 dims: X, Y (3490×2210)
  :roads      Int64 dims: X, Y (3490×2210)
  :landcover  UInt8 dims: X, Y (3490×2210)
  :ecotone    UInt8 dims: X, Y (3490×2210)
  :nlandcover UInt8 dims: X, Y (3490×2210)
In [95]:
dft = DataFrame(extract(rs, adf.geom))  # extract samples the rasterstack at point positions
dft.class = adf.class  # add class to the sampled dataframe
GeoDataFrames.write("dft.gpkg", dft)
dft
Out[95]:
56892×7 DataFrame
56867 rows omitted
Rowgeometryelevationlroadslandcoverecotonenlandcoverclass
IGeometr…Float32Int64UInt8UInt8UInt8Int64
1Geometry: wkbPoint711.512130010
2Geometry: wkbPoint1287.07010010
3Geometry: wkbPointNaN030010
4Geometry: wkbPoint1015.06030010
5Geometry: wkbPoint2870.55060020
6Geometry: wkbPoint1261.65110010
7Geometry: wkbPoint1354.16030010
8Geometry: wkbPointNaN030020
9Geometry: wkbPoint1751.8130010
10Geometry: wkbPoint1145.73130010
11Geometry: wkbPoint396.5950030
12Geometry: wkbPointNaN010120
13Geometry: wkbPoint2028.4030010
⋮⋮⋮⋮⋮⋮⋮⋮
56881Geometry: wkbPoint388.371210131
56882Geometry: wkbPoint607.229410011
56883Geometry: wkbPoint441.858310121
56884Geometry: wkbPoint672.889210011
56885Geometry: wkbPoint476.691110121
56886Geometry: wkbPoint641.447010011
56887Geometry: wkbPoint580.942110011
56888Geometry: wkbPoint413.948110011
56889Geometry: wkbPoint521.935140011
56890Geometry: wkbPoint327.531710121
56891Geometry: wkbPoint425.202210011
56892Geometry: wkbPoint682.191030011

We need to let our ML model know some features are categories.

In [96]:
using ScientificTypes
dft.class = coerce(dft.class, Multiclass)
dft.landcover = coerce(dft.landcover, Multiclass)
dft.ecotone = coerce(dft.ecotone, Multiclass)

select!(dft, Not(:geometry))
schema(dft)
Out[96]:
┌────────────┬────────────────┬─────────────────────────────────┐
│ names      │ scitypes       │ types                           │
├────────────┼────────────────┼─────────────────────────────────┤
│ elevationl │ Continuous     │ Float32                         │
│ roads      │ Count          │ Int64                           │
│ landcover  │ Multiclass{10} │ CategoricalValue{UInt8, UInt32} │
│ ecotone    │ Multiclass{2}  │ CategoricalValue{UInt8, UInt32} │
│ nlandcover │ Count          │ UInt8                           │
│ class      │ Multiclass{2}  │ CategoricalValue{Int64, UInt32} │
└────────────┴────────────────┴─────────────────────────────────┘

Modeling¶

We use XGBoost (with default settings), as it is easy, fast and performs well.

In [97]:
# Pkg.add("MLJXGBoostInterface")
Tree = @load XGBoostClassifier pkg=XGBoost
tree = Tree()  # no hypertuning yet
[ Info: For silent loading, specify `verbosity=0`. 
import MLJXGBoostInterface ✔
Out[97]:
XGBoostClassifier(
  test = 1, 
  num_round = 100, 
  booster = "gbtree", 
  disable_default_eval_metric = 0, 
  eta = 0.3, 
  num_parallel_tree = 1, 
  gamma = 0.0, 
  max_depth = 6, 
  min_child_weight = 1.0, 
  max_delta_step = 0.0, 
  subsample = 1.0, 
  colsample_bytree = 1.0, 
  colsample_bylevel = 1.0, 
  colsample_bynode = 1.0, 
  lambda = 1.0, 
  alpha = 0.0, 
  tree_method = "auto", 
  sketch_eps = 0.03, 
  scale_pos_weight = 1.0, 
  updater = nothing, 
  refresh_leaf = 1, 
  process_type = "default", 
  grow_policy = "depthwise", 
  max_leaves = 0, 
  max_bin = 256, 
  predictor = "cpu_predictor", 
  sample_type = "uniform", 
  normalize_type = "tree", 
  rate_drop = 0.0, 
  one_drop = 0, 
  skip_drop = 0.0, 
  feature_selector = "cyclic", 
  top_k = 0, 
  tweedie_variance_power = 1.5, 
  objective = "automatic", 
  base_score = 0.5, 
  watchlist = nothing, 
  nthread = 8, 
  importance_type = "gain", 
  seed = nothing, 
  validate_parameters = false, 
  eval_metric = String[])

We split our dataset into input and output.

In [98]:
y, x = unpack(dft, ==(:class); rng=123)
Out[98]:
(CategoricalArrays.CategoricalValue{Int64, UInt32}[0, 0, 1, 1, 1, 0, 0, 0, 1, 1  …  0, 1, 0, 1, 0, 1, 0, 0, 1, 1], 56892×5 DataFrame
   Row │ elevationl  roads  landcover  ecotone  nlandcover 
       │ Float32     Int64  Cat…       Cat…     UInt8      
───────┼───────────────────────────────────────────────────
     1 │   2051.35       1  30         0                 1
     2 │    447.443      3  10         0                 1
     3 │    469.101      3  10         0                 1
     4 │    348.177      6  50         0                 3
     5 │    530.609      2  10         0                 1
     6 │   2549.0        0  60         0                 2
     7 │    975.532      0  10         0                 1
     8 │   1362.45       0  30         0                 2
     9 │    564.383      2  30         0                 2
    10 │    450.751      4  10         1                 2
    11 │   1136.59       1  30         0                 2
   ⋮   │     ⋮         ⋮        ⋮         ⋮         ⋮
 56883 │    NaN          0  10         1                 2
 56884 │    503.322      4  10         0                 1
 56885 │    316.621      0  10         0                 1
 56886 │    471.01       3  10         0                 1
 56887 │   2351.03       0  30         0                 1
 56888 │    560.002      5  10         0                 1
 56889 │   1804.16       0  10         1                 2
 56890 │    381.971      1  30         0                 1
 56891 │    699.294      3  30         0                 2
 56892 │    430.053     10  50         0                 1
                                         56871 rows omitted)

And setup and run the model based on this dataset.

In [99]:
mach = machine(tree, x, y, scitype_check_level=0)
Out[99]:
untrained Machine; caches model-specific representations of data
  model: XGBoostClassifier(test = 1, …)
  args: 
    1:	Source @777 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Multiclass{10}}, AbstractVector{Multiclass{2}}}}
    2:	Source @766 ⏎ AbstractVector{Multiclass{2}}
In [100]:
pe = evaluate!(mach, resampling=StratifiedCV(; nfolds=6,
                               shuffle=true),
                 measures=[balanced_accuracy, kappa],
                 verbosity=1)
Evaluating over 6 folds: 100%[=========================] Time: 0:00:10
Out[100]:
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌─────────────────────┬──────────────┬─────────────┬─────────┬──────────────────
│ measure             │ operation    │ measurement │ 1.96*SE │ per_fold        ⋯
├─────────────────────┼──────────────┼─────────────┼─────────┼──────────────────
│ BalancedAccuracy(   │ predict_mode │ 0.873       │ 0.00389 │ [0.877, 0.876,  ⋯
│   adjusted = false) │              │             │         │                 ⋯
│ Kappa()             │ predict_mode │ 0.745       │ 0.00779 │ [0.753, 0.752,  ⋯
└─────────────────────┴──────────────┴─────────────┴─────────┴──────────────────
                                                                1 column omitted

These numbers look quite good (and it is fast!), but ideally one would further investigate the feature importance and determine whether these loss functions work as they should.

Prediction¶

With our trained model we can now make a prediction for the whole of Switzerland.

In [101]:
dfn = DataFrame(extract(rs, DimPoints(rs)))  # all values of the rasterstack as a row of points with attributes
select!(dfn, Not(:geometry))
dfn .= ifelse.(isinf.(dfn), 0.0, dfn)  # hard to predict with Inf
pred = predict_mode(mach, dfn)
Out[101]:
7712900-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

We can reshape this answer into a Raster again, and make sure it adheres to the requirements (0=false, 100=true).

In [106]:
pr = Raster(int.(reshape(pred, size(rs)[1:2])), dims=dims(rs))
pr .-= 1  # Converting the CategoricalArray to int starts with 1
pr[mask.==255] .= 0  # Set values outside of Switzerland to false
pr .*= 100
Rasters.write("prediction.tif", pr, force=true)
plot(pr)
┌ Warning: `missing` cant be written to .tif, missinval for `UInt32` of `4294967295` used instead
└ @ Rasters ~/.julia/packages/Rasters/47MWf/src/utils.jl:32
Out[106]:

The above map is binary and will do badly in the hackathon (using logloss). Hence, using the prediction probability of each class will give us exactly what we need.

In [107]:
ppred = MLJ.predict(mach, dfn)  # note the difference with predict_mode above
ppr = Raster(reshape(pdf(ppred, [1]), size(rs)[1:2]), dims=dims(rs))  # pdf(ppred, [1]) gives us the probability for class 1
pprr = round.(UInt8, ppr * 100)  # to percentage in UInt8 instead of 0.x Float
pprr[mask.==255] .= 0  # Set values outside of Switzerland to false
Rasters.write("pprediction.tif", pprr, force=true)
plot(pprr)
┌ Warning: `missing` cant be written to .tif, missinval for `UInt8` of `255` used instead
└ @ Rasters ~/.julia/packages/Rasters/47MWf/src/utils.jl:32
Out[107]:

Let's zoom in to see how we're doing

In [143]:
using GeoInterface, Extents
chur = view(pprr, X(2740000..2760000), Y(1170000..1190000))
x = plot(chur)
plot(x, tdf.geom[1:1000], c="white")
Out[143]:

Overall this map looks pretty good, it seems most of the prediction is based on the road network. There are some buffered lines that might be the forest edge. There's also some noisy predictions, which might be smaller patches of either urban or open landcover (like pastures).